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Abstract: The web of data consists of data published on the web in such a way that they can be 
interpreted and connected together. It is thus critical to establish links between these data, both for 
the web of data and for the semantic web that it contributes to feed. We consider here the various 
techniques developed for that purpose and analyze their commonalities and differences. We propose 
a general framework and show how the diverse techniques fit in the framework. From this framework 
we consider the relation between data interlinking and ontology matching. Although, they can be 
considered similar at a certain level (they both relate formal entities), they serve different purposes, 
but would find a mutual benefit at collaborating. We thus present a scheme under which it is possible 
for data linking tools to take advantage of ontology alignments. 

Key-words: Semantic web, data interlinking, instance matching, ontology alignment, web of data 
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MeLinDa: 
un cadre pour le liage des donnees du web 

Resume : Le web des donnees consiste a publier des donnees sur le web de telle sorte qu'elles puis- 
sent etre interpretees et connecttees entre elles. II est done vital d'etablir les liens entre ces donnees a 
la fois pour le web des donnees et pour le web semantique qu'il contribue a nourrir. Nous considerons 
les diverses techniques proposees a cette fin et analysons leurs similarites et differences. Nous pro- 
posons un cadre general dans lequel s'inscrivent les differentes techniques utilisees et nous montrons 
comment elles s'y inserent. Ce cadre permet de considerer les relations entre le liage de donnees et 
I'alignement d'ontologies: bien que ces activites puissent etre considerees comme similaires (elles 
trouvent les relations entre entites), elles n'ont pas le meme but mais beneficieraient a collaborer. 
Nous proposons une architecture permettant aux outils de liage de tirer parti des alignements entre 
ontologies. 

Mots-cles : Web semantique, liage de donnees, alignement d'ontologies, web des donnees 



an interlinking framework for the web of data 



1 Introduction 

The web of data is the network resuhing from publishing structured data sources in RDF and inter- 
linking these data sources with explicit links. A large quantity of structured data is being published 
particularly through the Linking Open Data projeciJ. Web data sets are expressed according to one or 
more vocabularies or ontologies, which range from simple database schema exposure to full-fledged 
ontologies. 

The web of data requires to interlink the various published data sources. Given the large amount 
of published data, it is necessary to provide means to automatically link those data. Many tools were 
recently proposed in order to solve this problem, each having its own characteristics (see SectionHJi. 

In many cases, data sets containing similar resources are published using different ontologies. 
Hence, data interlinking tools need to reconcile these ontologies before finding the links between 
entities. This could be done automatically, but more often this is done manually and built in the link 
specifications. This has two drawbacks: (a) this prevents to reuse the work made in ontology matching 
for reconciling ontologies, and (b) the information about reconciling the ontologies is mixed with the 
information about how to identify entities. 

Hence, the goal of this work is to analyse existing interlinking tools and to determine (1) how 
they fit in the same framework, (2) if it is possible to define a language for specifying the linking 
techniques to be used, and (3) how is data interlinking related to ontology matching. This report 
contributions are as follows: 

- A comprehensive survey of existing data interlinking tools, 

- A characterization of task/problem categories for web data set interlinking, 

- A proposal for improving data interlinking tools with ontology alignments. 

For that purpose, after briefly introducing the challenges of data interlinking and ontology match- 
ing (Section |2]i, we provide a general framework for data interlinking in which all these tools can 
be included (Section |3]l. From this analysis, we review six data interlinking tools and the way they 
are built (Section |4]l. This framework clearly separates the data interlinking and ontology matching 
activities and we show how these can collaborate through three different languages for links, data 
linking specifications and ontology alignments (Section |5]l. We provide examples of an expressive 
alignment language (Section |6]l and a linking specification language (Section [TJ. Finally, we show 
how these two languages can be adapted for cooperating (Section|8]l. 

2 Web of data, data interlinking, and ontology alignment 

We briefly introduce linked data and the data interlinking problem. We provide examples of this 
problem and why it would require specific linking tools. We then present why these tools could take 
advantage of ontology matching and alignments. 

2.1 Linked data 

The web of data is based on the following four principles | |Berners-Lee, 2009l|Heath and Bizer, 201 1| : 

1 . Resources are identified by URIs. 

2. URIs are dereferenceable. 

3. When a URI is dereferenced, a description of the identified resource should be returned, ideally 
adapted through content negotiation. 

4. Published web data sets must contain links to other web data sets. 



jhttp: //esw. w3 ■ org/topic/SweoIG/TaskForces/CommunityPro jects/LinkingOpenData| 



RR n° 7691 



MeLinDa 



As long as they follow these rules, linked data can be published in various ways (RDF data sets, 
SPARQL endpoints, XHTML+RDFa pages |Adida ef a/., 2008| , databases exposed through HTTP 
I jBizer, 2003[|Sahoo et al., 2009| ). Web data sets can also be constructed collaboratively, through the 
use of speciaUzed tools | |Volkel et al, 2006 1 . 

2.2 The data interlinking problem and linksets 

A main problem on the web of data is to create links between entities of different data sets. Most 
often, this consists of identifying the same entity across different data sets and publishing a link 
between them as a owl : sameAs statement (shortened as sameAs hereafter). We call this task data 
interlinking and summarize it in Figure [T] 

owhsameAs 
URI1 URI2 



Figure 1 : The data interlinking problem. 

Once identified, the links discovered between two data sets must also be published in order to be 
reused. The VoiD vocabulary | |Alexander et ah, 2 009] allows for describing linksets as special data 
sets containing links between resources of two given data sets. A linkset is represented as an RDF 
named graph described using VoiD annotations, as shown in the RDF/N3 code below: 

{ 

<http : //www . example . org/ linkset /DBPedia-MB> 
a void:Linkset ; 

void: target <http : //www . dpbedia . org>; 
void: target <http : //www .musicbrainz . org>; 
} 

<http : //www. example . org/ linkset /DBPedia-MB> 
{ 
<http : //www . dbpedia . org/resource/ Johann_Sebastian_Bach> 
owl : sameAs 

<http://www.musicbrainz.org/artist/2 4fl7 66e-9635-4d58-a4d4-9413f9f98a4c> . 
} 

Once linksets are constructed, two approaches are proposed to retrieve equivalences between 
resources: it is possible to assign to each real world entity a global identifier that will then be 
related to every URIs describing this entity. This is the approach taken in the OKKAM project 



I Bouquet et al., 2008] that proposes the usage of Entity Name Servers taking the role of resource 



name repositories. The other approach uses equivalence lists maintained with interlinked resources 
across data sets. There is thus no global identifier in this approach but equivalence links can be 
followed using a third-party web service, e.g., |http : //sameas . org| or a bilatteral protocol 
l lVolzefg/., 20091 . 

The data interlinking task can be achieved manually or with the help of data interlinking tools. 
These tools take as input two data sets and ultimately provide a linkset. In addition, they use what 
we call a linking specification, i.e., a "script" specifying how and/or what to link. Indeed, given 
data set sizes, the search space for resources interlinking can reach several billion resourceo It is 
thus necessary to use heuristics giving hints to the interlinking system about where to look for the 
corresponding resources in the two data sets. These linking specifications can be specific to a pair of 



4.2 billion RDF triples related by 142 million links: source Wikipedia, May 2009. 
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data sets and can be reused for regenerating linksets (we provide an example of such a specification 
in the Silk language in Section|7]l. 

2.3 Interlinking data sets 

Mining for similar resources in two web data sets raises many problems. Each data set having its 
own namespace, resources in different data sets are given different URIs. Also, although naming 
conventions exist ISauermann and Cyganiak, 20081, there is no formal nor standard way of naming 
resources. For example, if we take the URI for the famous musician Johann Sebastian Bach in various 
web data sets we obtain different results though they all represent the same real world object (Table[T]i. 



Dataset 


URI 


MusicBrainz 


http://musicbrainz.org/artist/24fl766e-9635-4d58-a4d4-9413f9f98a4c 


LastFM 


http://www.last.fm/music/JohannH-SebastianH-Bach 


DBpedia 


http://dbpedia.org/resource/Johann_Sebastian_Bach 


OpenCyc 


http://sw.opencyc.org/concept/Mx4rwJw6npwpEbGdrcN5Y29ycA 



Table 1 : Varying URIs across different data sets. 

As this example demonstrates, URIs are different across data sets, both because of their names- 
paces and because of their fragments. Fragments are generated according to two strategies: an internal 
ID as for MusicBrainz and OpenCyc, or the concatenation of some of the resource properties, as for 
LastFM and DBpedia. When the first strategy is used, an interlinking system might not be able to 
find correspondences between two resources by looking at URIs only. 

Fortunately, dereferencing URIs can be used for retrieving more information about entities: prop- 
erty values and related resources can be observed. But for the same real-world entity, the same prop- 
erty can take different values, making the interlinking process more difficult. This can be because 
of varying value approximations across data sets, because of different units of measure, because of 
mistakes in the data sets, or because of loose ontological specifications. For instance, the property 
f oaf : name does not specify in what format should the name be given. "J.S. Bach", "Bach, J.S." or 
"Johann Sebastian Bach" are possible values for this property. Hence, data interlinking tools have 
to compare property values in order to decide if two entities are the same, and must be linked, or 
not. For that purpose, tools use similarity measures based on the type of values, e.g., string, num- 
bers, dates, and aggregate the results of these measures. This activity is reminiscent of record linkage 
in database which has been given considerable attention iF ellegi and Sunter, 1969[ [Winkler, 20061 
[Elmagarmid et al., 2007J . The tools studied in Section|4]reuse many of the record linkage techniques. 

Another problem is caused by the usage of heterogeneous ontologies for describing data sets. 
In this case, a same resource is typed according to different classes and described with different 
predicates belonging to different ontologies. For example, a name in a data set can be attributed using 
the f oaf : name data property from the FOAF ontology while it is attributed using the vcard : N object 
property from the VCard ontology in another data set. 

Hence, for the interlinking techniques to work, it is necessary that the data sets use the same 
ontology or that data interlinking tools are aware of the correspondences between ontologies. 



2.4 Ontology matching and alignment 



Ontology matching allows for finding correspondences between ontology entities | |Euzenat and Shvaiko, 2007] . 
The result of this process is called an ontology alignment. Once the ontologies matched, the aUgn- 
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ment can be stored and retrieved when an application needs to use data described according to another 
ontology lEuzenat et ah, 2007a| . 

Matching ontologies requires to overcome the mismatches originating from the different concep- 
tualizations of the domains described by ontologies |Visser et al., 19971 [Klein, 2001) . These mis- 
matches may be of different nature: terminological mismatches concern differences of naming such 
as the usage of synonym terms for concept labeling; conceptual mismatches concern different concep- 
tualizations of the domain such as structuring along different properties; structural mismatches con- 
cern heterogeneous structures, like different granularities in the class hierarchies. Ontology match- 
ing is similar to database schema matching |Rahm a nd Bernstein, 200l| . Specific works on ontol- 



ogy matching were proposed in the last ten years I Noy and Musen, 2000) that now reach maturity 
JEuzenat and Shvaiko, 2007) . It is not the purpose of this paper to describe any particular technique. 
While different URI constructions and variations of property values can find automatic solutions, 
the problem of having heterogeneous ontologies is in most interlinking tools solved by manually 
specifying the correspondences (see Table |2] Section [Hi. This considerably complexifies the inter- 
linking process. Ontology matching techniques can be used to facilitate the interlinking task and 
ontology alignments reused in linking specifications. 

The goal of this paper is to investigate the relationships between data interlinking and ontology 
matching. In particular, we want to understand if these two activities should be merged into a single 
activity and share the same formats or if there are good reasons to keep them separated. In the second 
perspective, we also want to establish how they can benefit from one another. For that purpose, we 
analyzed available systems for data interlinking. 



3 A framework for data interlinking 

We provide, in this section, a general framework encompassing the various approaches used to inter- 
link resources on the web of data. 

In the most general case illustrated in Figure[T] two web data sets are interlinked using a method 
for comparing their resources. We do not specify at this stage if the method should be automatic or 
manual. Neither do we specify if the two data sets are described using a common ontology or if the 
ontologies describing their resources differ. 

This is the goal of the following subsections to consider this. We first consider each case that may 
happen when interlinking data and describe them abstractly and through an example. In the end, we 
unify all this cases in a common framework. 

3.1 Manual interlinking 

In the first case, illustrated in Figure |2] resources are manually interlinked. 

owl:sameAs 



http://musicbrainz.org/artist/ 

24fl766e-9635-4d58-a4d4 

-9413f9f98a4c 



http://dbpedia.org/resource/ 
JohannSebastianBach 



Manual observation 



Figure 2: Example of manually linked resources. 
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Manually linking resources can be performed using collaborative tools in the case of large data 

sets. 

3.2 URI correspondence. 

In some cases, illustrated in Figure [3] resources can be trivially linked using a simple transformation 
of their URIs. 

owhsameAs 



URI1 



URI2 



■{URI transformation) 



Figure 3: Data interlinking through URI transformation. 

A set of rules can be defined to identify equivalent resources from their identifier. For example, in 
the data set LastFA/Q the URI representing an artist is built on the pattern "First_name+Last_name". 
Person URIs in DBpediqj are built around the pattern "FirstName_LastName". A trivial algorithm 
can be developed to find equivalent artists based on their URIs. This is illustrated in Figure|4]for the 
composer J. S. Bach. 

owhsameAs 



http://www.lastfm.fr/music/ 
Johann+Sebastian+Bach 



http://dbpedia.org/resource/ 
Johann_Sebastian_Bach 



URI alignment 



Figure 4; Example of resource linking using the correspondence between URIs. 



3.3 Datasets sharing the same ontologies 

Further than URIs, it may be necessary to consider the ontologies in order to identify entities. In a first 
case, illustrated in Figure|5] the two data sets to interlink are described by the same ontology. The role 
of the interlinking system is to analyze resources of the same type in order to detect the equivalent 
ones. To do this, the system compares resource properties with a similarity measure. Systems in this 
category take as input the properties to compare, the type of comparison algorithm to use for each 
property, and the method to aggregate the similarity measures of the various properties in order to 
construct a measure between two resources. 

For example, Jamendqj and MusicBrainqj, two data sets containing musicological data, are both 
described according to a common music ontology | |Raimond et al, 2007| . The artist J.S. Bach can 



^http://last.fm| 
^http: //dbpedia. orgj 
^ http : //www ■ jamendo . com| 
ihttp : //www.musicbrainz . org| 



RR n° 7691 



owl:sameAs 



MeLinDa 



URI1 



Resource matching 

of data sets described 

by the same ontology 




URI2 



Figure 5: InterHnking two data sets described according to the same ontology. 



be identified in both data sets by observing the first name and last name properties of the class Mu- 
sicArtist. It is not possible in this case to identify the equivalence of resources based on their URIs. 
This example is illustrated in Figure |6] 



DBPedia 



Johann- 
Sebastian 



mo:MusicArtist 
type/ "Ntype 




Musicbrainz 

URI2 
first/ \last 



Jean- 
Bach Sebastien Bach 



Resource matching algorithm, 
datasets described according 
to a common ontology 



Figure 6: Example of linking resources described according to the same ontology. 



3.4 Datasets described with heterogeneous ontologies 

Datasets can be described by different ontologies. This case is illustrated in Figure]?] In order to know 
which types of entities have to be linked together, the system needs to know the correspondences 
between these types of entities. Then it can work similarly as if there were a single ontology. 

We represent this case in Figure|2]by introducing the correspondences between ontology classes 
as an alignment. This alignment is presented as implicit because it does not exist as such, but it is 
mixed with the linking specification or the data interlinking system. 

Consider two data sets, one described using FOAF, the other using VCard. The linking specifica- 
tion will indicate to the tool to compare entities of type f oaf : Person and entities of type vcard : vc, 
and that when comparing resources of these types, the properties f oaf : givenname should be com- 
pared to vcard : fn, as well as the property foaf : familyname compared to the property vcard: In. 
This is an implicit alignment containing two correspondences. 
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URI1 



01 



Resource matching 

of data sets 

described by 

different ontologies 



Implicit alignment \ 



URI2 



-^02 



Figure 7: Two data sets interlinked using an implicit alignment. 

For example, OpenCyqJ represents the artist J.S. Bach using a different ontology than the one 
used to describe MusicBrainz. The properties "firstname" and "lastname" correspond to a property 
"EnglishID" in which both names are concatenated. The class MusicArtist in the Music Ontology 
corresponds to a class Classical Music Composer in OpenCyc. An alignment between classes and 
properties needs to be specified in order to find an equivalence between the two resources. This 
example is illustrated in Figure [8] 



OpenCyc 



Musicbrainz 



(£ 



asslcal Music Performer 



type 



URIl 



.English ID-* 



"Johann 

Sebastian 

Bach" 




Figure 8: Example of two data sets described with heterogeneous ontologies. 



3.5 Data interlinking with alignments 

Another approach, illustrated in Figure |9l takes advantage of an already existing explicit alignment 
between the two ontologies used by the data sets. 

An additional possibility, not found in existing systems, would be for the data linking system to 
first match the two ontologies before using the resulting alignment for supporting data interlinking. 
In such a system, ontology matching and data interUnking would be merged. 

Figure [lO] unifies all these processes in a single description. This framework leads to clarify 
interactions between data interlinking and ontology matching. 

The next section discusses different systems and their position with respect to the proposed frame- 
work. 



|sw ■ opencyc ■ org| 
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02 



Figure 9: Two data sets matched using an explicit alignment. 
owhsameAs 



URI1 



>[Data interlinking 
01 < ^^ Alignment 



Ontology matching 



URI2 



02 



Figure 10: General framework for data interlinking involving ontology matching. 

4 Data interlinking tool analysis 

The work presented in this section is the result of the MeLinDa experiment conducted jointly with 
the linked open data mailing list. We asked interlinking tool developers to send us the linking specifi- 
cations their tools take as input. We then compared these specifications and evaluated the possibility 
to publish them in a common language Six systems took part in the experiment. We are aware of at 
least two other systems not analyzed in this study | |Sais et ah, 2008[ Hogan et al, 2007) . 

We present below different criteria along which the tools can be compared, then we briefly de- 
scribe the specifics of each tool and provide comparison of them along the criteria (Table|2|l. 

4.1 Analysis criteria 

For each analyzed tool, we tried to answer several questions reproduced below. We will then describe 
and categorize each tool according to these questions. 

Degree of automation 

- Is the tool completely automatic (a black box)? 

- Does the tool need to be parametrized by the user? What kind of parameters (data matching 
techniques, ontology alignment)? 

Used matching tecliniques 



^http : TTmelinda ■ inrialpes ■ f r| MeLinDa stands for Meta-Linking Data. 
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- String matching? 

- External functions (values conversion, data transformations)? 

- Similarity propagation? 

- Other techniques? 

Access 

- How does the tool access data? (SPARQL endpoint, RDF Dump, URL) 

Ontologies 

- Does the tool take into account ontologies associated to the data sets? 

- Does the tool allow to interlink data sets described according to different ontologies? 

- If the ontologies differ, does the tool perform ontology matching? 

Output 

- What does the tool produce as output (sameAs links, VoiD linkset, other type of links)? 

- Does the tool propose to merge the two input data sets? 

Domain Is the tool specific for a given domain? 

Post-processing Does the tool perform any post-processing operations (consistency checking and 
inconsistency resolution)? 

4.2 Tools 

We considered the 6 following tools. Table |2] summarizes the analysis. 

4.2.1 RKB-CRS 

The co-reference resolution system (CRS) of the RKB knowledge base Paffrief a/., 2008| is built 
around URI equivalence lists. These lists are built using a Java program working on the specific 
domain of universities and conferences. A new program needs to be written for each pair of data sets 
to integrate. Each program consists of the selection of the resources to compare, and their comparison 
using string similarity on the resource property values. 

4.2.2 LD-Mapper 

LD-Mapper |Raimond et al. , 2008| is a data set integration tool working on the music domain. This 
tool is based on a similarity aggregation algorithm taking into account the similarity of a resource's 
neighbors in the graph describing it. It requires little user configuration but only works with data sets 
described with the Music Ontology | |Raimond et al, 2007 1 . LD-Mapper is implemented in Prolog. 

4.2.3 ODD-linker 

ODD-Linker |Hassanzadeh et al, 2009| is an interlinking tool implemented on top of a tool mining 
for equivalent records in relational databases. ODD-Linker uses SQL queries for identifying and com- 
paring resources. The tool translates link specifications expressed in the LinQL dedicated language 
originally developed for duplicate records detection in relational databases. Its usage in the context of 
linked data is thus limited to relational databases exposed as linked data. LinQL is nonetheless an ex- 
pressive formalism for link specifications. The language supports many string matching algorithms, 
hyponyms and synonyms, conditions on attribute values. 
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RKB CRS 


LD-Mapper 


ODD 


RDF-AI 


Silk 


Knofuss 


Ontologies 


multiple 


multiple 


single 


single 


single 


multiple 


Automation 


semi- 
automatic 


automatic 


semi- 
automatic 


semi- 
automatic 


semi- 
automatic 


semi- 
automatic 


User input 


program 


none 


link spec, 
query 


data set structure 
alignment method 


Hnks spec, 
alignment method 


fusion onto. 


Input format 


Java 


prolog 


LinQL 


XML 


Silk-LSL (XML) 


OWL 


Matching 
techniques 


string 


string, 

similarity 

propagation 


string 


string, 
Wordnet 


string 


string, 

adaptive 

learning 


Onto, alignment 


no 


no 


no 


no 


no 


yes, in input 


Output 


owl:sameAs 


owl:sameAs 
linkset 


linkset 


alignment format 
merged data set 


alignment format 
linkset 


alignment format 
merged data set 


Data access 


API 


local copy 


ODBC 


local copy 


SPARQL 


local copy 


Domain 


publications 


Music Ontology 


independent 


independent 


independent 


independent 


Post-processing 


no 


no 


no 


no 


no 


inconsistency 
resolution 



Table 2: Comparison of data linking tools. 
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4.2.4 RDF-AI 

RDF-AI IScharfFe et al, 2009] is an architecture for data set matching and fusion. It generates an 
alignment that can be later used either to generate a linkset, or to merge two data sets. The interlinking 
parameters of RDF-AI are given in a set of XML files corresponding to the different steps of the 
process. The data set structure and the resources to match are described in two files. This description 
corresponds to a small ontology containing only resources of interest and the properties to use in 
the matching process. A pre-processing file describes operations to perform on resources before 
matching. Translation of properties and name reordering are performed before looking for links. 
A matching configuration file describes which techniques should be used for which resources. A 
threshold for generating the linkset from the alignment can be specified. Additionally, when data sets 
need to be merged, a configuration file describes the fusion method to use. The prototype works with 
a local copy of the data sets and is implemented in Java. 

4.2.5 SUk 

Silk | |Bizer ef g/., 2009 [ is an interlinking tool parametrized by a link specification language: the Silk 
Link Specification Language (Silk LSL, see ® . The user specifies the type of resources to link 
and the comparison techniques to use. Datasets are referenced by giving the URI of the SPARQL 
endpoint from which they are accessible. A named graph can be specified in order to link only 
resources belonging to this graph. Resources to be linked are specified using their type, or the RDF 
path to access them. Silk uses many string comparison techniques, numerical and date similarity 
measures, concept distances in a taxonomy, and set similarities. A condition allows for specifying 
the matching algorithm used to match resources. Matching algorithms can be combined using a set 
of operators (MAX, MIN, AVG) and literals can be transformed before the comparison by specifying 
a transformation function, concatenating or splitting resources. Regular expressions can be be used 
to preprocess resources. Silk takes as input two web data sets accessible behind a SPARQL endpoint. 
When resources are matched with a confidence over a given threshold, the tool outputs sameAs links 
or any other RDF predicate specified by the user. The first version of Silk was implemented in Python; 
version 2 is a new implementation in Scala. 

4.2.6 Knofuss 

The Knossos architecture | |Nikolov et al., 2008) aims at providing support for data set fusion. A 
specificity of Knofuss is the possibility to use existing ontology alignments. The resource comparison 
process is driven by a dedicated ontology for each data set specifying resources to compare, as well 
as the comparison techniques to use. Each ontology gives, for each type of resource to be matched, 
an application context defined as a SPARQL query for this type of resource. An object context model 
is also defined to specify properties that will be used to match these resource types. Corresponding 
apphcation contexts are given the same ID in the two ontologies and one application context indicates 
which similarity metric should be used for comparing them. When the two data sets are described 
using different ontologies, an ontology alignment can be specified. This alignment is given in the 
ontology alignment format | |David et al., 201 1] . Knofuss allows for exporting links between data 
sets, but was originally designed to merge equivalent resources. It includes a consistency resolution 
module which ensures that the data sets resulting from the fusion of the two data sets is consistent 
with respect to the ontologies. The parameters of the fusion operation are also given in the input 
ontologies. Knofuss works with local copies of the data sets and is implemented in Java. 

An analysis of these tools according to the criteria of Section 14.11 is summarized in Table |2l 
Obviously there is a lot of variation between these tools in spite of their common goal. Even if they 
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are very diverse, each of these data interhnking tools fit in the proposed framework as shown on 
Tabled 



Case 


Tool 


Manual Hnk specification (33.11) 




URl correspondence (33.21) 


RKB-CRS 


Common ontologv (33.3b 


LD-Mapper, ODD-Linker 


Different ontologies, impHcit alignment (33.4b 


RDF-Al, Silk 


Different ontologies, explicit aUgnement (33.51) 


Knofuss 



Table 3: Classification of analyzed tools with regard to the framework. 



The goal of the next section is to consider how using ontology alignments could lead to more 
automation for the interlinking task, as well as how linked data could provide evidence for obtaining 
better ontology alignments. 

5 Matching/linking cooperation 

Although ontology matching and data interlinking can be similar at a certain level (they both relate 
formal entities), there are important differences: one acts at the schema level and the other at the 
instance level. In fact, ontology matching can take advantage of linked data as an external source 
of information for ontology matching, and, conversely, data interlinking can benefit from ontology 
matching by using correspondences to focus the search for potential instance level links. 
These differences are reflected in the types of specification involved in these processes: 

- A link, e.g., a sameAs statement, tells which city in wikipedia correspond to which p (place) 
in geonames, e.g., Manchester sameAs Manchester 

- A linking specification tells how to find the former, e.g., for linking a City to a p, evaluate 
how the label of the first one is close to the name of the second one with some measure, 
e.g., jaroSimilarity, evaluate how the populationTotal of the first one is close to the 
population of the second one with another measure, e.g., numSimilarity, average the two 
values and if the result is above .9, then generate the sameAs statement. 

- An ontology alignment tells which components from one ontology corresponds to which com- 
ponents in the other For example, dbpedia : City is a kind of geonames : P and in this con- 
text, label is equivalent to name and populationTotal is equivalent to population. 

This results in two process specifications - interlinking and matching - and their results - linksets 
between data and alignments between ontologies. The situation is summarized by Table |4] 





process 


result 


instance 


linking specification 


linkset 


class 


matcher 


alignment 



Table 4: Interlinking and matching processes and their results. 

By clearly establishing these differences, we obtain a natural partitioning between data links, 
linking specifications and ontology alignments and the languages for expressing them: 

The assertion expression language (e.g., RDF and VoiD) allows for representing equivalence be- 
tween resources in data sets; 
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The linking specification language (e.g., Silk, LinQL) allows for defining how to search for equiv- 
alence between resources; 

The alignment representation language (e.g., the Alignment format or EDOAL) allows for speci- 
fying equivalence rules between ontological entities. 

It would be useful to take advantage of the framework of Section[3] to help tools interoperate. This 
would present many advantages, in particular the possibility to share, distribute and improve linking 
specifications, as well as reuse them or extend them instead of computing them again whenever a 
data set is modified. This would also allow to compose linking specifications such that it would be 
possible to go from one data set to another without going through an intermediary. 

We propose a scheme under which it is possible for data linking tools to take ontology align- 
ments as a way to constrain their solution space. Figure [TO]provides a natural way to implement this 
collaboration. 

We first present an expressive language for ontology alignments that can be exploited by data 
interlinking systems (Section |6]l. and briefly introduce the linking specification language Silk-LSL 
(Section|7]i. Then we show how they could fruitfully be combined for data interlinking (Section[8]l. 

6 EDOAL: an expressive ontology alignment language 

EDOAL (Expressive Declarative Ontology Alignment Language) is the new name of the OMWG 
mapping language for expressing ontology alignment [E uzenat et al, 2007bj that has been available 
through the Alignment API since version 3.1. This language is an extension of the Alignment format 
I IEuzenat, 2004| that can be generated by most matchers. Its main purpose is to offer more expres- 
siveness in the way alignments are expressed. It presents the advantage to be declarative and also to 
specify transformations like those needed in order to construct links between resources. 

A first advantage of the expressiveness of EDOAL is the possibility to express correspondences 
between non named entities. For instance, a simple assertion such as "a pianist is a musician who 
plays piano", can be expressed by (Figure fTTTi: 

:dbp-mo a align: Alignment; 

align : onto 1 <http: //dbpedia . o rg/ ontology />; 
align : onto2 <http : //www .musicontology . com/>; 
align:map [ :mapl a alignrCell; 
align : entityl dbp:Pianist; 
align : entity2 [ a edoalrClass; 
edoalrand morMusicArtist; 

edoaliand [ a edoal : PropertyValueConstraint ; 
edoal rproperty mo : instrument ; 
edoal:value moiPiano. 
] - 
]; 

align : relation align : equivalent ; 
] . 

This can help restricting the search space of data interlinking tools far beyond what they currently 
do (named classes). 



[Pianist ]« 



[MusicArtist] 



[PianoJ" instrument 



Figure 1 1 : Correspondence between non named resources. 
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In addition, in EDOAL, it is possible to express that two classes are equivalent, and that their 
instances are equivalent modulo a transformation. This can be used for covering, without further 
information, the URl correspondence case of the framework (Section [3.2b . For instance. Figure [12] 
shows an EDOAL correspondence using regular expression transformations for identifying musician 
instances between two data sets with different conventions. 

, , < , 



Classical Music Performer k H MusicArtist 



http://.../(r+]*)$2+([7]*)$l/ 

* rdfiid 4 > rdf:icl 

http://.../(r_]*)$l_(r.]*])$2.rdf 



Figure 12: Expression of a resource equivalence represented in an expressive ontology alignment 
language. 

Of course, this can only work when there exists such correspondences, i.e., an exact method for 
generating links. Most of the time, data interlinking systems still need to use heuristics to find links 
between entities. This can be provided by the simple Alignment format, but EDOAL can do more by 
indicating where to look for to establish the correspondence. 

In particular, EDOAL allows for expressing contextual relations between elements. For instance, 
the typical example in Silk documentation is the linking of DBpedia cities and geoname P(laces) 
through comparing their names and populations. Expressing this with the simple alignment: 

: dbp-geo a align : Alignment; 

align : ontol <http : //dbpedia . org/ontology/>; 
align : onto 2 <http : //www . geoname s .org/ontology#>; 
alignimap [ :mapl a aligniCell; 

align : entityl dbpedia : City; 

align : entity2 gn:P; 

align : relation align : subsumedBy. 
]; 
alignimap [ :map2 a align:Cell; 

align: entityl dbpedia :populationTotal; 

align : entity2 gn : population; 

align : relation align : equivalent . 
]; 
alignimap [ :map3 a aligniCell; 

align : entityl rdfs:label; 

align : entity2 gniname; 

align : relation align : equivalent . 
] . 

does not express the expected meaning because, of course, rdfs : label is not equivalent to gn :name. 
One could consider expressing that gn : name is more specific than rdfs : label. This is correct but 
still not precise enough. The intended meaning is that, in the context of dbpdia : City and gn : p, 
these two properties are equivalent. This is what EDOAL can express through the schema of Fig- 
ure[T3]con'esponding to the following alignment: 

: dbp-geo a align : Alignment; 

align : ontol <http : //dbpedia . org/ontology/>; 
align : onto 2 <http : //www . geoname s .org/ontology#>; 
alignimap [ :mapl a aligniCell; 
align : entityl dbpedia : City; 
align : entity2 gn : P ; 
align : relation align : subsumedBy . 
]; 
alignimap [ :map2 a aligniCell; 

align : entityl [ a align : Property; 
edoal : and dbpedia :populationTotal . 
edoal : and [ a edoal :PropertyDomainRestriction; 
edoal: domain dbpedia : City . ]; 
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align :entity2 [ a align: Property; 
edoal : and gn : population ; 

edoal : and [ a edoal : P ropert yDomainRest riot ion; 
edoal : domain gn : P . ] ; 
align:relation align: equivalent . 
]; 
align :map [ :map2 a align :Cell; 

align :entityl [ a align: Property; 
edoal : and rdf s : label . 

edoal : and [ a edoal : Propert yDomainRest riot ion; 
edoal : domain dbpedia : City . ] ; 
align:entity2 [ a align: Property; 
edoal : and gn : name; 

edoal : and [ a edoal : PropertyDomainRestriction; 
edoal : domain gn : P . ] ; 
align:relation align: equivalent . 



Cityk 



< 



■^P 



-*■ rdfs: label ■<- 



-> name -• 



populationTotal 4^ 



-^ population 



Figure 13: Contextual matching of of two classes and its properties. 

Even if such an alignment would provide information to data interlinking tools, this is still not suf- 
ficient. Of course, it tells which properties should be equivalent and thus can be used for identifying 
entities. But it does not tell how to take them into account. So, this alignement would be sufficient to 
link entities if the values of rdfs : label were exactly the same as those of gn : name and the values 
of populationTotal were exactly the same as those of population, but not otherwise. 

EDOAL provides more features for transforming this information, like we have seen in Fig- 
ure[T2l This could be helpful but the problem is deeper: data interlinking is a decision problem rather 
that just a transformation. It is the role of the data linking specification to tell when a particular 
dbpedia: City and a gn : p should be considered the same. This is why we propose to use data 
interlinking specifications together with alignments. 



7 Silk-LSL: a linking specification language 



Below is the Silk-LSL | |Bizer ef a/., 2009) specification to interlink cities in the two data sets DBpedia 
and Geonames: 



<Prefix id-"rdfs" namespace= 

"http://www.w3.Org/2000/01/rdf-schema#" /> 
<Prefix id-"dbpedia" namespace- 

"http: / /dbpedia. org/ontology/" /> 
<Prefix id-"gn" namespace- 

"http : //www. geonames . org/ontology# " /> 

<DataSource id="dbpedia"> 

<EndpointURI>http : //demo_sparql_serverl/sparql 

</EndpointURI> 

<Graph>http : //dbpedia . org</Graph> 
</DataSource> 

<DataSource id-"geonames"> 

<EndpointURI>http : //demo_sparql_server2/sparql 
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</EndpolntURI> 

<Graph>http : //sws . geonames . org/</Graph> 
</DataSource> 

<Interlink id-"cities"> 

<LinkType>owl : sameAs</LinkType> 

<SourcGDataset dataSource-"dbpedia" var-"a"> 

<RestrictTo> 

?a rdfitype dbpedia:City 

</RestrlctTo> 
</SourceDataset> 

<TargetDataset dataSource-"geonamGs" var-"b"> 

<RestrictTo> 

?b rdfrtype gn : P 

</RestrlctTo> 
</TargetDataset> 

<LinkCondition> 
<AVG> 

<Compare metric-" jaroSimilarity"> 

<Param name-"strl" path="?a/rdfs : label" /> 
<Parain name-"str2" path="?b/gn:name" /> 
</Compare> 

<Compare metric-"numSimilarity "> 
<Param name-"numl" 

path-" ?a/dbpedia rpopulationTotal" /> 
<Param name-"num2" path-" ?b/gn rpopulation" /> 
</Compare> 

</AVG> 
</LinkCondition> 

<Thresholds accept="0.9" verlfy="0.7" /> 
<Output accept edLinks-" accept ed_links . n3" 

verif yLinks-" verif y_links . n3 " 

mode^"truncate" /> 
</Interlink> 

</Silk> 

This specification fulfills two roles: 

- It is an alignment: it specifies the classes in which entities to link can be found. Restrictions 
to dbpedia : City and gn : p are in fact an alignment between these two concepts. Similarly, 
the compared properties populationTotal and population and rdf s : label and name, 
respectively provide the correspondences between properties. 

- It specifies how to link entities. Indeed, what Silk brings in addition is the specification of 
how to decide if two entities should be linked: when the average (avg) of their respective dis- 
tances (Compare) is over a threshold (Threshold, there are two thresholds, one for accepting 
automatically the equivalence and one for drawing the attention of a user). 

It could be possible to refer to an external alignment between the two underlying ontologies 
instead of specifying it in the linking specification. This approach would present obvious reuse ad- 
vantages when other data sets requiring the same alignment, i.e., using the same ontologies, need to 
be interlinked. 

8 Data interlinking using ontology alignments 

Apart from Knofuss, interlinking tools do not provide the possibility to use an ontology alignment. 
Knofuss still needs to specify queries on both data sets from which results equivalent resources will 
be identified. 

Indeed, using an explicit alignment, provided that it is expressive enough, can serve two functions: 
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1. narrowing the search space through pointing to equivalent concepts, and 

2. providing the properties that can be used for identifying concepts. 

There are two ways to articulate ontology alignment and linking specifications: 

- Transforming an expressive alignment into a linking specification: this requires that the align- 
ment contains as much information as possible and that matchers be able to produce such 
descriptions. This has the advantage that from the alignment, the specification may be trans- 
formed into different linking specification languages. 

- Enabling linking specifications to refer explicitly to alignments and eventually to matchers: 
this requires extending specification languages for that purpose. 

We consider the latter option below. Given that the alignment is available, it is possible to simplify 
the Silk specification and refer to the alignment, by introducing three types of information: which 
alignments to use (useAlignment), entities of which correspondences must be linked (LinkCell) 
and which matched properties can be compared for identifying entities (CellParam). 

<UseAlignment rdf : resource-" #citap-geo" /> 

<Interlink id-"cities"> 

<LinkType>owl : sameAs</LinkType> 

<LinkCell rdf ; resource-"#mapl " /> 

<LinkCondition> 

<And comblner="AVG"> 

<Compare metric-" jaroSimilarity"> 

<CellParam rdf ; resource-"#map2" /> 
</Compare> 
<Compare metric-"numSimilarity"> 

<CellParam rdf : resoiirce-"#map3" /> 
</Compare> 
</And> 
</LinkCondition> 

<Thresholds accept="0.9" verify="0.7" /> 
<Output acceptedLinks-"accepted_links . n3" 

verif yLinks-"verif y_links . n3" 

mode-"truncate" /> 

</Interllnk> 

The specifics of the data interlinking task remain in this specification: how to compare values, 
how to aggregate their results and when to issue the link or not. 

In fact, the symbiosis between the alignment and the linking specification can be rendered even 
more automatic, e.g., by defining default rules for comparing values of a given type, default rules for 
aggregating metrics, and default threshold rules. However, it is also useful that the linking specifica- 
tion designer can keep control on what the interlinking tool does and, even if a correspondence is not 
in an alignment, be able to define it. 

This approach presents several advantages: 

1 . The link specification is simplified, reducing the manual input; 

2. There is a clear separation between links, linking specification, and ontology alignments; 

3. The same alignment can be reused for linking any two data sets described according to the 
same ontologies. 

9 Conclusion 

Interlinking data sets becomes an even more important problem as their number quickly increase. In 
order to scale, the interlinking task has to be as automated as possible. 
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We have studied various existing data interlinking tools and observed the following: 

- Beyond the variations between these systems, it is possible to define a general framework 
covering the different levels of expressiveness (ranging from a Prolog program to compositions 
of Unking specifications). 

- Although there is a relevant similarity with ontology alignment, an ontology alignment lan- 
guage is not enough to express linking specifications, particularly because it is not its primarily 
goal to identify individual entities. 

We have thus proposed an architecture based on three different languages having each its own precise 
purpose: expressing links, expressing linking specifications, and expressing ontology alignments. 

This architecture can be used in order to organize a better collaboration between ontology match- 
ers and data interlinking tools. This can be achieved with only minimal extensions to existing lan- 
guages. 

In particular, we have illustrated the ontology alignment part with EDOAL, an expressive ontol- 
ogy alignment language that offers the necessary concepts for being used in data interlinking. On the 
data interlinking side, we have focussed on the Silk-LSL language which seems to be at once declar- 
ative and powerful enough to express a wide range of constraints on data interlinking. Extending it 
with the capacity to benefit from ontology alignments would allow tools using it to benefit from the 
wide range of ontology alignment techniques and tools. 

The domain of interlinking data on the web is quickly expanding. New needs and new techniques 
appear. It is thus important not to breed innovations with a narrow language. Developing standard 
tools to share link specifications will greatly improve those techniques. There is still a lot of work to 
do in order to achieve this goal. 
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